Prepare the risk variable data for clustering by:
Converting categorical variables to factors for proper handling
Removing non-clustering variables (demographics, IDs) while preserving participant IDs as row names for later identification
Separating continuous and categorical variables for differential processing
Defining a flexible scaling function that supports multiple normalization methods (i.e., z-score, min-max, percentile, max absolute, and robust scaling) dependent upon which performs optimally in previous testing steps
Applying the specified scaling method (i.e., z scaling) to standardize continuous variables while leaving categorical variables unchanged
## Data Prep ##
#1. Ensure dichotimization of categorical data for clustering
risk_variable_data <- risk_variable_data %>%
mutate(across(c("family_history_depression", "family_history_mania", "bullying"), as.factor))
#2. Retain subject IDs as row names while removing variables that will not be clustered
#2.1 Remove all variables except subject ID from the dataframe for clustering
risk_variable_clustering_data <- risk_variable_data %>%
dplyr::select(-session_id, -family_id, -site, -sex, -age, -race_ethnicity)
#2.2 Set subject IDs as row names
row.names(risk_variable_clustering_data) <- risk_variable_clustering_data$participant_id
#2.3 Remove subject ID column
risk_variable_clustering_data <- risk_variable_clustering_data %>%
dplyr::select(-participant_id)
#3. Identify continuous and categorical variables
continuous_vars <- names(risk_variable_clustering_data)[sapply(risk_variable_clustering_data, is.numeric)]
categorical_vars <- names(risk_variable_clustering_data)[sapply(risk_variable_clustering_data, is.factor)]
#4. Scaling function to be matched with established param
apply_scaling <- function(df, method) {
out <- df
if (method != "none") {
cont <- df[continuous_vars]
scaled <- switch(method,
z_score = scale(cont),
min_max = purrr::map_df(cont, ~ (.x - min(.x, na.rm=TRUE)) / (max(.x, na.rm=TRUE) - min(.x, na.rm=TRUE))),
percentile = purrr::map_df(cont, ~ rank(.x, na.last="keep") / sum(!is.na(.x)) * 100),
max_absolute = purrr::map_df(cont, ~ .x / max(abs(.x), na.rm=TRUE)),
robust = purrr::map_df(cont, ~ (.x - median(.x, na.rm=TRUE)) / (IQR(.x, na.rm=TRUE) %||% 1))
)
out[continuous_vars] <- scaled
}
out
}
#5. Scale data according to param set in script call
scaled_data <- apply_scaling(risk_variable_clustering_data, params$scaling_method)
Choosing the appropriate number of clusters (k) is critical for uncovering meaningful groupings in mixed type data. In this analysis, we combine descriptive elbow analysis with (ideally multiple) internal validation indices to arrive at a robust, transparent recommendation for k between 2 and 8:
Elbow method We plot the within cluster sum of squares (WSS) across k = 2:8 and apply the Kneedle algorithm [Satopaa et al., 2011] via the inflection package (Emekes, 2017) to quantitatively identify the elbow point. This provides a visual and numeric guide, but is treated as descriptive rather than prescriptive for purposes of our analyses
Internal validation indices Using validation_kproto(), we generate the silhouette statistical validation index, and loop in any of the other relevant validation indices (i.e., C-Index, Dunn, Gamma, Point-biserial, and Tau) that were able to be completed in the maximum alloed 7 day run-time for each k
Consensus evaluation We record the optimal k returned by each index and examine patterns of agreement or discrepancy. A convergence of multiple indices on the same k will strengthen our confidence in that choice; though a nuanced examination of index scores, the distribution of index scores, and the elbow method will be consulted to
In this step we:
Compute k-prototypes clusterings for k = 2:8 once and cache the results
Extract within-cluster sum of squares (WSS) for elbow diagnostics
Run the silhouette index with one SLURM task per k value
Log start/end times, WSS values, and per-index k opt for reproducibility
The silhouette index specific job has now finished and written:
A single cached k-prototype list (k = 2:8)
WSS values for each k
A validation index object (val_z_score_silhouette.rds) with optimal k for the silhouette
Next, we will:
Read and combine all WSS into an elbow plot and run Kneedle
Merge each available indexs full k-vs-score trajectories
Compute each available indexs preferred k and select a consensus k
Visualize results via line plots and heatmaps for final inspection
## Determining Optimal Number of Clusters (k Value) ##
#1. Generate clustering solutions & validation indices of interest
#1.1 Establish relevant parameters & paths
#1.1.1 Create a vector containing the k's to test
k_vals <- 2:8
#1.1.2 List where to stash / load the kproto list for this method
proto_file <- file.path(kproto_dir, sprintf("kproto_%s.rds", scaling_method))
#1.2 Load or compute all k-prototypes clusters (unchanged)
if (file.exists(proto_file)) {
#1.2.1 Skip clustering if already done
kp_list <- readRDS(proto_file)
cat(
sprintf("%s|PROTO_SKIP|%s|loaded %d ks from cache\n", format(Sys.time(), "%Y-%m-%d %H:%M:%OS3"),
scaling_method, length(k_vals)),
file = log_file, append = TRUE
)
} else {
#1.2.2 Compute clusters in parallel across k if not completed
proto_start <- Sys.time()
cat(
sprintf("%s|PROTO_START|%s\n",
format(proto_start, "%Y-%m-%d %H:%M:%OS3"),
scaling_method),
file = log_file, append = TRUE
)
#1.2.3 Store a list containing each kp object (parallel over 7 k's)
kp_list <- future_map(
k_vals,
~ kproto(x = scaled_data,
k = .x,
nstart = 8,
verbose = TRUE),
.options = furrr_options(seed = TRUE)
)
#1.2.4 Provide a k suffix to each of the k values
names(kp_list) <- paste0("k", k_vals)
#1.2.5 Save the kproto list
saveRDS(kp_list, proto_file)
#1.2.6 Log end status for clustering
proto_end <- Sys.time()
cat(
sprintf("%s|PROTO_END|%s|duration=%.1fmin\n", format(proto_end, "%Y-%m-%d %H:%M:%OS3"),
scaling_method, as.numeric(difftime(proto_end, proto_start, units = "mins"))),
file = log_file, append = TRUE
)
}
#1.2.5 Extract total within cluster sum of squares (WSS) for elbow plot
wss_vec <- map_dbl(kp_list, "tot.withinss")
names(wss_vec) <- k_vals
#1.2.6 Log WSS for each k
for (j in seq_along(k_vals)) {
cat(
sprintf("%s|WSS|%s|k=%d|wss=%.1f\n", format(Sys.time(), "%Y-%m-%d %H:%M:%OS3"),
scaling_method, k_vals[j], wss_vec[j]),
file = log_file, append = TRUE
)
}
#1.3 Read in all available validation objects (silhouette always there; others if finished)
#1.3.1 Find every val_<method>_<index>.rds for this scaling method
val_files <- list.files(
validation_dir,
pattern = sprintf("^val_%s_.*\\.rds$", scaling_method),
full.names = TRUE
)
#1.3.2 Extract the validation index names from the filenames
indices <- basename(val_files) %>%
stringr::str_remove(sprintf("^val_%s_", scaling_method)) %>%
stringr::str_remove("\\.rds$")
#1.3.3 Read them all in and name the list
val_list <- purrr::map(val_files, readRDS)
names(val_list) <- indices
#1.4 Bind & summarize all index results for plotting
index_k_df <- purrr::map_dfr(
val_list,
function(vr) {
ks <- as.integer(names(vr$indices))
vals <- as.numeric(vr$indices)
tibble(index = vr$method, k = ks, value = vals)
}
)
#2. Generate elbow diagnostics & Kneedle (UIK)
#2.1 Prepare a data.frame for plotting
elbow_df <- tibble(
k = as.integer(names(wss_vec)),
wss = as.numeric(wss_vec)
)
#2.2 Create first basic elbow plot
#2.2.1 Generate basic plot
elbow_plot <- ggplot(elbow_df, aes(x = k, y = wss)) +
geom_line(size = 1) +
geom_point(size = 3) +
labs(
x = "Number of clusters (k)",
y = "Total within cluster sum of squares",
title = "Elbow plot: WSS vs k") +
theme_minimal(base_size = 14)
#2.2.2 Print the basic elbow plot
print(elbow_plot)
#2.3 Find the knee via the Kneedle (UIK) algorithm
knee <- uik(elbow_df$k, elbow_df$wss)
#2.3.1 Log the UIK elbow into the detailed log
cat(
sprintf(
"%s|UIK_ELBOW|%s|k=%d\n", format(Sys.time(), "%Y-%m-%d %H:%M:%OS3"),
scaling_method, knee),
file = log_file, append = TRUE)
#2.4 Create a more detailed elbow plot with UIK annotation
#2.4.1 Generate the elbow plot with the UIK
elbow_plot_uik <- ggplot(elbow_df, aes(x = k, y = wss)) +
geom_line(size = 1) +
geom_point(size = 3) +
geom_vline(
xintercept = knee,
linetype = "dashed",
size = 0.8,
color = "red") +
annotate(
"text",
x = knee,
y = max(elbow_df$wss) * 0.95,
label = paste0("UIK elbow:\nk = ", knee),
hjust = 0,
size = 5,
color = "red") +
labs(
x = "Number of clusters (k)",
y = "Total WSS",
title = "Elbow plot with Kneedle (UIK) selection") +
theme_minimal(base_size = 14)
#2.4.2 Print the elbow plot with the UIK
print(elbow_plot_uik)
#2.5 Cluster compactness / separation metrics per k
#2.5.1 Compute mixed type total sum of squares (TSS) once via 1-cluster prototype
#2.5.1.1 Calculate the kproto object
t1 <- kproto(
x = scaled_data,
k = 1,
nstart = 1,
verbose = FALSE)
#2.5.1.2 Store the total within SS value for the whole mixed dataset
TSS_mixed <- t1$tot.withinss
#2.5.2 Loop over each k to compute RMSSD & R2
cluster_metrics <- purrr::map_dfr(
k_vals,
function(k) {
#2.5.2.1 Extract the k-prototype fit for this k by position
position <- which(k_vals == k)
kp <- kp_list[[position]]
wss_cl <- kp$withinss
sz <- as.numeric(kp$size)
tot_wss<- kp$tot.withinss
#2.5.2.2 RMSSD: sqrt(mean( SS_cl / cluster_size ))
rmssd <- sqrt(mean(wss_cl / sz))
#2.5.2.3 R2: proportion of variance explained
r2 <- (TSS_mixed - tot_wss) / TSS_mixed
#2.5.2.4 Log to detailed log
cat(
sprintf(
"%s|METRICS|%s|k=%d|RMSSD=%.4f|R2=%.4f\n", format(Sys.time(), "%Y-%m-%d %H:%M:%OS3"),
scaling_method, k, rmssd, r2), file = log_file, append = TRUE)
#2.5.2.5 Return a one-row tibble with our metrics
tibble(
k = k,
RMSSD = rmssd,
R2 = r2)
}
)
#2.5.3 Print a summary table
knitr::kable(
cluster_metrics,
caption = glue::glue("Compactness/separation metrics per k ({scaling_method} scaling)"),
digits = 3) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| k | RMSSD | R2 |
|---|---|---|
| 2 | 4.309 | 0.164 |
| 3 | 4.015 | 0.233 |
| 4 | 3.837 | 0.278 |
| 5 | 3.853 | 0.308 |
| 6 | 3.715 | 0.333 |
| 7 | 3.619 | 0.352 |
| 8 | 3.539 | 0.368 |
#2.6 UMAP & blob diagnostics for selected k values
#2.6.1 Create a function to one hot encode categorical variables + pipe in scaled numerics
prepare_mixed_data <- function(data) {
cat_cols <- sapply(data, function(x) is.factor(x) || is.character(x))
num_cols <- sapply(data, is.numeric)
comps <- list()
if (any(cat_cols)) {
cd <- data[, cat_cols, drop = FALSE]
cd[] <- lapply(cd, function(x) if (is.character(x)) as.factor(x) else x)
comps$cat <- dummy_cols(cd, remove_first_dummy = TRUE, remove_selected_columns = TRUE)
}
if (any(num_cols)) {
nd <- data[, num_cols, drop = FALSE]
comps$num <- as.data.frame(nd)
}
do.call(cbind, comps)
}
#2.6.2 Choose the k's to visualize - current most likely optimal candidates
selected_k <- c(2, 3, 4)
#2.6.3 Loop over each selected k and create UMAP + diagnostics
for (k in selected_k) {
#2.6.3.1 Extract data & cluster labels by position
pos <- which(k_vals == k)
kp_obj <- kp_list[[pos]]
data_k <- kp_obj$data
cluster_labels<- factor(kp_obj$cluster)
#2.6.3.2 Prepare data_for_umap and run UMAP with tuned params
data_for_umap <- prepare_mixed_data(data_k)
umap_config <- umap.defaults
umap_config$n_neighbors <- 15
umap_config$min_dist <- 0.1
umap_config$spread <- 1.5
umap_config$n_epochs <- 500
#2.6.3.3 Generate the UMAP results
umap_result <- umap(data_for_umap, config = umap_config)
#2.6.3.4 Build a plotting dataframe
umap_df <- data.frame(
UMAP1 = umap_result$layout[,1],
UMAP2 = umap_result$layout[,2],
Cluster = as.factor(cluster_labels)
)
#2.6.3.5 UMAP scatter plot
p_umap <- ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = Cluster)) +
geom_point(size = 1.5, alpha = 0.6) +
labs(
title = glue::glue("UMAP Visualization of Clusters (k = {k})"),
subtitle = paste("n_neighbors =", umap_config$n_neighbors, ", min_dist =", umap_config$min_dist),
x = "UMAP Dimension 1",
y = "UMAP Dimension 2") +
theme_minimal() +
theme(
legend.title = element_text(size = 12),
plot.title = element_text(size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5)) +
guides(color = guide_legend(override.aes = list(alpha = 1, size = 3)))
print(p_umap)
#2.6.3.6 UMAP blob analysis plot
p_blob_analysis <- ggplot(umap_df, aes(x = UMAP1, y = UMAP2)) +
geom_density_2d(alpha = 0.3, color = "gray50") +
geom_point(aes(color = Cluster), size = 1.2, alpha = 0.7) +
stat_summary_2d(aes(z = as.numeric(Cluster)), fun = mean, alpha = 0.7, bins = 10) +
labs(
title = glue::glue("UMAP Blob Analysis (k = {k})"),
subtitle = "Even overlapping clusters can show meaningful structure",
x = "UMAP Dimension 1",
y = "UMAP Dimension 2") +
theme_minimal() +
theme(legend.title = element_text(size = 12))
print(p_blob_analysis)
#2.6.3.7 Individual cluster highlight plots
unique_clusters <- sort(unique(cluster_labels))
cluster_plots <- lapply(unique_clusters, function(cl_id) {
plot_data <- umap_df
plot_data$highlight <- ifelse(plot_data$Cluster == cl_id, paste("Cluster", cl_id), "Other")
#2.6.3.8 Plot the cluster highlights
ggplot(plot_data, aes(x = UMAP1, y = UMAP2)) +
geom_point(aes(color = highlight, alpha = highlight), size = 1) +
scale_color_manual(values = c("gray80", "red")) +
scale_alpha_manual(values = c(0.3, 0.8)) +
labs(title = paste("Cluster", cl_id, "Distribution")) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(size = 10),
axis.text = element_text(size = 8)
)
})
#2.6.3.9 Arrange & print up to 6 per row
if (length(cluster_plots) <= 6) {
do.call(grid.arrange, c(cluster_plots, ncol = 2, top = glue::glue("k = {k} Individual Cluster Patterns")))
} else {
print(cluster_plots)
}
}
#3. Summarize & visualize validation indices & pick a consensus k
#3.1 Define which direction better for each index
dir_tbl <- tribble(
~index, ~direction, ~label,
"silhouette", "higher", "Silhouette (higher = better)",
"cindex", "lower", "C-index (lower = better)",
"gamma", "higher", "Gamma (higher = better)",
"ptbiserial", "higher", "Point-biserial (higher = better)",
"tau", "higher", "Tau (higher = better)"
)
#3.2 Pull out full index vs k values from cached validation objects
#3.2.1 Build a single data.frame with index, k, and value
index_k_df <- purrr::map2_dfr(
indices, val_list,
function(idx, vr) {
ks <- as.integer(names(vr$indices))
vals<- as.numeric(vr$indices)
tibble(index = idx, k = ks, value = vals)
}
)
#3.2.2 Now filter dir_tbl to only the indices we actually have for the current run
dir_tbl2 <- dir_tbl %>% filter(index %in% index_k_df$index)
#3.2.3 Join direction & label onto the data
index_k_df <- index_k_df %>%
left_join(dir_tbl2, by = "index")
#3.3 Compute each index's best k
summary_tbl <- index_k_df %>%
group_by(index, direction, label) %>%
summarise(
k_opt = if (direction[1] == "lower") k[which.min(value)] else k[which.max(value)],
index_opt = if (direction[1] == "lower") min(value) else max(value),
.groups = "drop"
)
#3.3.1 Determine consensus k (at least according to simple vote)
consensus_k <- summary_tbl$k_opt %>% table() %>% which.max() %>% names() %>% as.integer()
#3.3.2 Print and log the "consensus k"
cat(sprintf("%s|CONSENSUS_K|%s|%d\n", format(Sys.time(), "%Y-%m-%d %H:%M:%OS3"),
scaling_method, consensus_k),
file = log_file, append = TRUE)
#3.4 Create a line plot of all indices vs k
#3.4.1 Generate the line plot for each index x k value
index_line_plot <- ggplot(index_k_df, aes(x = k, y = value, color = label)) +
geom_line(size = 1) +
geom_point(size = 2) +
geom_vline(xintercept = consensus_k, linetype = "dashed", color = "grey50") +
annotate(
"text", x = consensus_k + 0.2,
y = max(index_k_df$value)*0.95,
label = paste0("Consensus k = ", consensus_k),
hjust = 0, size = 5) +
labs(
x = "Number of clusters (k)",
y = "Validation-index value",
color = "Index & direction",
title = glue::glue("Validation-index trajectories ({scaling_method} scaling)")) +
theme_minimal(base_size = 14)
#3.4.2 Print the line plot for each index x k value
print(index_line_plot)
#3.5 Create a heatmap of index values x each k
#3.5.1 Generate the heatmap
index_heatmap <- ggplot(index_k_df, aes(x = factor(k), y = label, fill = value)) +
geom_tile(color = "white") +
scale_fill_viridis_c(option = "magma") +
labs(
x = "k",
y = NULL,
fill = "Value",
title = "Heatmap of validation-index values") +
theme_minimal(base_size = 14) +
theme(axis.text.x = element_text(vjust = 0.5))
#3.5.2 Print the heatmap
print(index_heatmap)
#3.6 List the tabular summary of each index's optimal k & value
summary_tbl %>%
arrange(label) %>%
dplyr::select(label, k_opt, index_opt) %>%
knitr::kable(
caption = glue::glue("Optimal k per index ({scaling_method} scaling)"),
col.names = c("Index (direction)", "optimal k", "Value"),
digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))
| Index (direction) | optimal k | Value |
|---|---|---|
| Silhouette (higher = better) | 2 | 0.518 |
Interpretation:
The elbow plot suggests a bend at k = 4, reinforced by Kneedles quantitative estimate
Among the 1 validation indices we were able to run, 1 (Silhouette (higher = better)) vote for k = 2
Taken together, these results indicate that k = 2 likely balances cohesion and separation most consistently in our risk variable data; pending contextualization of these results post-completion of this script